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ABSTRACT 

Objective To build an effective co-reference resolution 
system tailored to the biomedical domain. 
Methods Experimental materials used in this study 
were provided by the 2011 i2b2 Natural Language 
Processing Challenge. The 2011 i2b2 challenge involves 
co-reference resolution in medical documents. Concept 
mentions have been annotated in clinical texts, and the 
mentions that co-refer in each document are linked by 
co-reference chains. Normally, there are two ways of 
constructing a system to automatically discoverco-referent 
links. One is to manually build rules forco-reference 
resolution; the other is to use machine learning systems 
to learn automatically from training datasets and then 
perform the resolution task on testing datasets. 
Results The existing co-reference resolution systems are 
able to find some of the co-referent links; our rule based 
system performs well, finding the majority of the co- 
referent links. Our system achieved 89.6% overall 
performance on multiple medical datasets. 
Conclusions Manually crafted rules based on 
observation of training data is a valid way to accomplish 
high performance in this co-reference resolution task for 
the critical biomedical domain. 
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BACKGROUND 

Co-reference resolution is the process of linking 
together concepts that refer to the same entity. The 
ability to have computers automatically find this 
type of relation in text documents is of interest to 
people in the field of artificial intelligence because 
it can lead to having systems that can summarize 
texts and answer questions posed about informa- 
tion contained within those documents. 1 2 
Automatic summaries and question answering 
systems could be of great value to personnel in the 
healthcare industry as well. Because of these possi- 
bilities, a Natural Language Processing Challenge 
was hosted in 2011 by the i2b2 (Informatics for 
Integrating Biology and the Bedside) in order to 
advance co-reference resolution technology for the 
field of automatic biomedical document analysis 
and understanding. Annotated data was provided 
by four institutions: Partners HealthCare, Beth 
Israel Deaconess Medical Center, the University of 
Pittsburgh, and the Mayo Clinic. These data 
include the original texts for medical documents, a 
concept file for each document that describes con- 
cepts mentioned in the texts, and chain files that 
identify manually created co-reference chains in 
each of the texts as an example of how chains are 
to look after processing. The concept mentions to 
be linked are nouns or descriptive phrases in the 
medical texts that represent people, actions, 
objects, or ideas and have been given types 



accordingly. Two methods were adopted by the 
hosts of the challenge to annotate the datasets in 
the i2b2 shared task. The first method is the i2b2 
style annotations that include five concept categor- 
ies: people, problems, tests, treatments, and pro- 
nouns. The other method used is ODIE (Ontology 
Development and Information Extraction) style 
annotations that include eight categories: disease or 
syndrome, sign or symptom, procedure, people, 
other, none, laboratory or test result, and anatom- 
ical site. Each type of concept mention will only 
co-refer with a concept mention of the same type, 
with the exception of pronouns that can co-refer 
with any type of mention. 3 This challenge has been 
divided in to three tracks; the University of 
Houston-Downtown (UHD) team is participating 
in two of these. The first track is to first find mark- 
ables, and then find co-reference between them. 
The second track is to find co-reference relations 
between already marked concepts in the ODIE 
style of annotation. The third track is to find 
co-reference relations between already marked con- 
cepts in the i2b2 style of annotation. 

OBJECTIVE 

The aim of this study was to build an effective rule- 
based co-reference resolution system and compare 
its performance with that of some publicly available 
co-reference systems. For the critical biomedical 
domain, time and effort spent in building carefully- 
crafted rules could be a well justified necessity to 
achieve the desired performance required in prac- 
tice. To fully evaluate our approach, we conducted 
a comprehensive study that examined the perform- 
ance of three publicly available general purpose 
co-reference resolution systems. 

MATERIALS AND METHODS 

The data used in this challenge came in two sets, 
training and test data. The training data differs 
from the test data by having gold standard 
co-reference chains included with it that could be 
used as a guide for constructing co-reference rules, 
either by hand, or by machine learning algorithms. 
The datasets are from the four institutions named 
above. Each of the institutions provided over 1000 
documents 1 in which concept mentions and 
co-reference were marked by hand. The test data 
consisted of 323 documents marked using the i2b2 
annotation style, and 59 documents marked using 
the ODIE style annotations, all of which were 
taken from the pool of over 1000 documents pro- 
vided by the four institutions. Our method is to 
test three well developed publicly available 
co-reference resolutions systems that are already 
built, as well as an algorithm constructed by our 
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team, on the training data provided by i2b2. The best perform- 
ing system was used to create output from the test data provided 
by i2b2 at the end of the challenge. Because the specifications 
for co-reference resolution in the i2b2 challenge were well 
defined and the type of data provided is specific, 3 we adopted a 
rule-based approach for our system built in this study. For this 
challenge, the UHD only participated in the second and third 
tracks. That means our method was designed only to find 
co-reference chains in text documents when the gold standard 
concepts are already given as input. It is important to note that 
the publicly available systems are responsible for discovery of 
their own concept markables, whereas the UHD rule based 
system is given to them. In order to conduct a more equal evalu- 
ation and comparison of the four systems, the three publicly 
available systems would need to be capable of utilizing the gold 
standard data as input rather than having them rely on their 
own markables discovery. Attempts to do just that were made, 
however, since each of the systems has no facet for inputting 
such data, and the source code for each of the systems is not 
available in order to create a method to do so, it was impossible 
for the UHD team to have the three systems utilize the gold 
standard data as the rule based system does. When processing 
the datasets provided by i2b2, the gold standard concept files 
that came with the data were used to mark the concepts in the 
text documents. The system was developed by examining a 
sample of files, 15 per dataset that we felt were representative of 
the data as a whole, from the pool of training data, and con- 
structing linking functions or rules, based on observation. The 
linking functions were checked across the unused training 
dataset to get an idea of rules that worked, and those that did 
not. The system consists of six components, and uses four data 
sources to aid in creating co-referent links. The general architec- 
ture for our system is depicted in figure 1. 



Data input and access 

The first two routines in the system are made to read in the text 
being examined and the concepts that are to be linked from the 
files provided by i2b2. The document handler breaks the text 
into tokens using white space boundaries, with each space char- 
acter indicating the end of one word and the beginning of the 
next. The text is then stored in a two-dimensional array, where 
the first dimension is the line number, and the second dimen- 
sion is the word number. A representation of this operation is 
depicted in figure 2. 



The document handler controls access to this matrix and 
gives the system a way to easily find the location of the concepts 
in the text, and a way to search the words surrounding the con- 
cepts for information about the concept. The concept handler 
reads in each concept and stores it in an array giving each 
concept a number based on its position in the array. Each 
element in the array holds the start line, start word, end line, 
end word, type, and the text within each concept. The concept 
handler gives easy access to the attributes of each concept. An 
example of concept storage can be found in figure 3. 

Main linker 

The next routine in the algorithm is the main linker, which 
matches all the concepts that are not in the person category. 
Every concept that passes through this linker is compared to 
each of the other concepts of the same type in the document 
and links are recorded if they meet the programmed criteria. 
Decisions made by this linker are binary, meaning they either 
match or do not match. At this stage, every link that is detected 
is kept, which means a concept can have links to many concepts 
within the document, rather than at most two that is a charac- 
teristic of co-reference chains. The main linker uses string 
matching, the UMLS 4 (Unified Medical Language System) data- 
base, and the WordNet 5 database to determine if two concepts 
might have the same meaning. The main linker traverses the 
concept list and runs each one through its set of rules, and 
stores detected links in a list of pairs that is organized later on 
in the chain builder. 

Non-personal pronoun match 

The first step with each concept is to check if it is a pronoun 
type. If it is a pronoun type concept and the word is 'which' or 
'that', it is linked to the concept that immediately precedes it if 
the two concepts have fewer than two words between the two 
concepts. There are other pronouns mentioned, but any rules 
written for them only resulted in performance loss when testing 
across unused training data was conducted; we were unable to 
build a reliable rule for any other pronoun. Example: '... deep 
wound culture showed MRS A which is sensitive to...'. 

Be phrase match 

The next step with each concept is to check the type of the con- 
cepts that immediately precede and follow the concept. If they 
are of the same type, the text in between the two concepts is 
examined and if it contains any words that indicate it is a 'be 



Figure 1 Co-reference resolution 
system architecture. 
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Figure 2 Representation of 
document handler functionality. 
Access the article online to view this 
figure in colour. 



The patientwas admitted with hypertension. 

The condition was presented alongwith tachycardia. 



Document: [0][0]"The" [0][1] "patient"... [0][5] "hypertension." 

[1][0] "The"[1][1] "condition"... [1][6] "tachycardia." 



phrase', the two concepts are linked because they are probably 
saying 'something is something'. Words and phrases that are 
commonly found in the 'be phrases' are stored in the rule data- 
base, and were added to the database manually by the UHD 
team based on observations of gold standard links. Example: 
'Resolution of organism is methicillin-resistant 
Staphylococcus...'. 

Match by meaning 

After the 'be phrase' match, the concepts are examined and 
linked by their meanings. First, the concepts are conditioned by 
filtering out what we refer to as 'common words'. These 
common words include conjunctions (and, or, as, but, etc), 
adjectives (large, blue, painful, etc), and pronouns (he, she, it, 
etc). The conjunctions and pronouns that are filtered out are 
chosen to be eliminated from the concept if they appear in the 
common words table of the rule database. Each of the words 
that appear in the common words table was manually placed 
there by the UHD team. Adjectives are detected by searches in 
the WordNet database. After elimination of the common words, 
any non-letter characters, such as punctuation and hyphens, are 
removed. After this conditioning, the concepts are compared to 
every other concept of the same type on the document in three 
ways. 

Head and synonym match 

First, every leftover word in the concept is compared to every 
leftover word in each of the other concepts of the same type in 
the document by a word comparison method. This word com- 
parison method will declare the words a match if the first 80% 
of the characters in the shorter word match the same number of 
characters in the longer word, or if they are found to be 
WordNet synonyms. If every word in one of the concepts is 
matched to a word in the other concept, a link between the two 
is recorded. Example: 'abscess abscesses'. 

UMLS match 

The second comparison is through the UMLS database. Both 
concepts are searched for in the MRCONSO table of the UMLS 
database after the conditioning, and if they are found in the 
database and their UMLS concept numbers match, a link 
between the two is recorded. Example: 'renal' and 'kidney' both 
have C011773 for a concept number in the UMLS database 



Acronym match 

The third type of comparison is a check for acronyms. The first 
letters of each word in concepts that have two or more words 
are taken and are compared to whole words in other concepts, 
and if a whole word is found that matches either all the first 
letters, or some of them in order, a link is recorded. Example: 
'Methicillin-resistant Staphylococcus aureus MRS A. 

After performing these steps, a phrase like 'Recurrent soft 
tissue abscess in the gluteal region' will link to 'tissue abscesses' 
because tissue is present in both mentions, abscess matches to 
abscesses by way of the head match, and since there are only 
two words in the second mention, all the other words in the 
first mention are ignored. 

People linker 

All concept types are processed though the same path in the 
algorithm except for the mentions of type 'person' or 'people'. 
These mentions are processed by the people linker. As with the 
main linker, all decisions made by this linker are binary. 

Identifying people mentions 

When the people linker is called to examine a document, it runs 
through several subroutines to identify 'person' type mentions 
as being doctors or the subject of the document. 

Medical personnel 

The first step performs internet searches on each concept 
mention. The mention being processed is sent to a search 
engine, and the results are scanned for certain key words to 
indicate if the mention is referring to a doctor or medical per- 
sonnel. Every mention that is found to be of medical personnel 
is stored in a list for later use. Example: when 'optometrist' is 
sent to the search engine, it returns many results like this: 

Doctors of Optometry and their Education | American 
Optometric... 

http://www.aoa.org/x5879.xml 

Doctors of Optometry and their Education. Doctors of optom- 
etry are the nation's largest eye care profession, serving patients 
in nearly 6500 communities across... 

These results are searched for keywords such as doctor, clinic, 
hospital, medical, etc. If two or more of those types of medical 
keywords are present, the mention is marked as being medical 
personnel. 



Figure 3 Representation of concept 
handler functionality. Access the 
article online to view this figure in 
colour. 



RawConcepts: 

c="hypertension" 0:5 0:5||t="problern" 
c="the condition" 1 :0 1 :1 ||t="problem" 



Concept: [0] text - "hypertension", startLine - 0, endLine - 0, 

startWord - 5, endWord - 5, type - "problem" 
[1] text - "the condition", startLine -20, endLine -20, 
startWord - 5, endWord - 5, type - "problem" 
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The subject (patient) 

The second step is to find a name in the document to represent 
the subject of the document. The function checks each concept 
and whether it meets the following criteria: 

► It is not a pronoun. 

► It is not found to be a doctor according to the previous 
check. 

► It does not have the doctor salutation, Dr. 

► It has no medical title at the end, MD. 

► It does not contain common words stored in the rule data- 
base such as 'patient' or words that would indicate it is a 
family member, 

That concept is marked as the subject of the document. If no 
such concepts that fit that criteria are found, the first occurrence 
of a concept that says 'patient' or 'pt' is marked as the subject 
since the patient has been the subject of the document in every 
document observed by the UHD team. After finding an appro- 
priate representation of the subject, every concept that has the 
words 'patient' or 'pt' in it and no words that refer to a family 
member is linked to the subject concept. 

The subject's gender 

The third step is to find the gender of the subject. This function 
simply counts the number of masculine and feminine pronouns 
in the document; the type that is more frequent is declared to 
be the gender of the subject. 

Matching people mentions 

After gathering information about the 'person' and 'people' 
type concept mentions, the algorithm moves on to actually 
create links between these mentions. 

Introduction match 

If two concepts are found to be no more than two words apart 
with one starting with a doctor salutation, or ending with a 
medical title, and the other was marked as referring to a doctor 
by the internet searches or by the database which stores words 
that identify mentions as medical personnel (eg, Attending), the 
two concepts are linked as this likely indicates an introduction 
of someone. Example: 'Please follow-up with your Optometrist, 
Dr Smith 2019-01-16 at 8:30'. 

Partial match 

After linking the introductions, a matching function is run that 
works the same way as the head matching function in the main 
linker. Certain words are removed from concepts, such as salu- 
tations, pronouns, titles, and single letters, as well as punctu- 
ation; they are then compared to each other. If all of the words, 
up to 80% of the length of the word, in each concept appear in 
the other concept, a link between them is recorded. This match 
will link people's names together, including those that appear 
with an initial for the first name in one instance and the full 
name in another. Example: 'You will see Edward L, Smith ... on 
your visit to Dr Smith's clinic'. 

Pronoun linking 

The next step in the people linker is to match third person pro- 
nouns to the names that refer to them. This is done by searching 
the sentence that contains pronoun concepts. 

Third person, no proper names in the sentence 

If the sentence has only pronoun mentions in it, each of the 
pronouns in that sentence are linked to the subject concept if 



they are of the same gender as the subject. If it is not the same 
gender as the subject, the closest preceding concept that is not a 
pronoun is linked to it. 

Third person, with proper names in the sentence 

If there is one name in the sentence, and the name's position in 
the sentence is before the pronoun, then it is linked to that 
name. If there are multiple names in the sentence, any pronoun 
that is the gender of the subject is linked to the subject and the 
others are linked to the first name in the sentence that is found 
to be a doctor. 

Other pronouns, including first and second person pronouns 

After this, any person concepts that are first person pronouns 
are linked together, and any second person pronouns are linked 
to the subject. The last step is to link any pronoun type mention 
that is the word 'this' to the next person mention if it is within 
three words of it and the next mention is not a doctor; then, 
any pronoun type mention that is the word 'who' is linked to 
the previous person type mention that is not any type of 
pronoun. 

Link filtering 

After the semantic links are made in the main linker, they are 
passed over to filters to eliminate links that actually refer to two 
different entities based on clues found in the sentences sur- 
rounding the mentions in question. These clues include descrip- 
tive phrases such as dates, locations, or descriptive modifiers not 
included in the span of the mention. These clues are found by 
using regular expressions for dates and keywords stored in the 
rule database for locations and descriptive modifiers compared 
by string matching. These clues are only searched for if the 
word preceding or following each mention is one of the key- 
words stored in the rule database. Examples include in, on, are, 
is, etc. The filter portion of the algorithm also eliminates links 
using WordNet; any mention that is found to be an adjective 
with no noun included has any links to it removed. 

Building the chains 

Once the linkers and the filter have finished their jobs, the final 
output is created from the 'web' of links that has been made. 
The first concept with links is found and each link is traversed 
to the next concept, and each of those links is followed in a 
recursive fashion. A list of each concept visited is kept, and 
though concepts can be linked more than one time, they are 
added only once to the list. After every link has been examined 
in the 'web', the list of concepts is sorted according to each con- 
cept's position in the text. Concepts that appear in the begin- 
ning of the text are at the top of the list. Once a chain is 
constructed, it is written to an output file in the i2b2 format. 

RESULTS AND DISCUSSION 

There are a number of systems publicly available for 
co-reference resolution. For the purpose of comparison, we con- 
ducted experiments with three widely adopted systems — 
Beautiful Anaphora Resolution Toolkit (BART), the Stanford 
co-reference system, and LingPipe — and provide their perform- 
ance based on i2b2 testing data. Each system was evaluated in 
two ways. The first method was to compare each link with the 
provided co-reference chain annotations, and count it as correct 
only if it matches exactly with the provided annotation. With 
this method, single unlinked concept mentions which are not 
co-referent to any other mentions, called 'singletons', are not 
considered, and links that fall in the same chain but skip an 
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Figure 4 Example of exact match 
scoring. Highlighted sections are 
concept mentions. Blue arrows are 
correct links; red is counted as 
incorrect even though it is a 
co-reference because it skips a 
mention. Access the article online to 
view this figure in colour. 



Recurrent soft tissue abscess in-the-gtuteal region , Methicillin - . 



Correct 



and he said that there a r e mu l t i ple a b scesses in her gluteal region , there is no indication.. 



Incorrect 



Correct 



that the patient has a recurrence of tissue abscesses with MRSA, mo 



antecedent are considered incorrect. An example of this can be 
seen in figure 4. 

This scoring method is referred to as 'exact match' scoring 
and is a method we devised before receiving the i2b2 evaluation 
script. This method was used because the i2b2 evaluation script 
was not made available until late in the course of the challenge; 
after use it seemed to give a better representation of the per- 
formance of the systems as we could measure individual concept 
type performance. The second method of evaluation is with a 
script provided by i2b2 that conducts four types of examina- 
tions of the chain output for each system: B-Cubed, 6 MUC, 7 
Blanc, 8 and CEAF. 9 Since many of the concept mentions, 
approximately 30-60%, depending on the dataset, are single- 
tons, the i2b2 evaluation script will produce a much higher 
score than the exact match method because it considers single- 
tons to be correct co-reference chains. Overall performance 
results using both methods are listed at the end of this section. 
The results are in the form of an fl score, which is the har- 
monic mean of precision and recall. 

Beautiful Anaphora Resolution Toolkit 

BART was developed from a project done at the 2007 Johns 
Hopkins Summer workshop (http://www.bart-coref.org/). 10 
Once set up, text is sent to it through a web service, and output 
is returned in xml format. The output contains detected 
concept mentions and if they belong to a chain, the chain identi- 
fier is included in the xml tag of the concept mention. A transla- 
tor was created to compare the BART output to the chain files 
included with the input texts. Only concept mentions detected 
by the BART system and listed by the i2b2 annotations were 
considered for testing; all other mentions and co-referent links 
were discarded. Individual concept type linking scores using the 
exact match scoring are listed in table 1. The i2b2 evaluation 
script results for each training dataset are shown in table 2. 

Stanford co-reference system 

The Stanford co-reference system is an ongoing project by the 
Stanford Natural Processing Language Group (http://nlp. 
stanford.edu/software/dcoref.shtml). 11 It uses a 'multi-pass sieve' 
to perform co-reference resolution, which is a layered approach 
to detecting links between mentions. It starts with the strongest 
match first, then uses more and more relaxed criteria for 
matches as it runs down the layers of co-referring rules. Input 
was supplying the raw text in a string, and output from this 
system comes in the form of a map stored in an array. Each 
element of the array holds the location, in the form of line 
number and word number in the text, of a source mention, and 
a destination mention. A simple mapping function was con- 
structed to convert the Stanford concept locations to i2b2 
concept locations. Only concept mentions that were found by 
the Stanford system and listed by the i2b2 annotations were 
considered; all other mentions and co-referent links were dis- 
carded. Individual concept type linking scores using the exact 



match scoring are listed in table 1. The i2b2 evaluation script 
results for each training dataset are shown in table 2. 

LingPipe 

LingPipe is a suite of natural language processing tools provided 
by the Alias-i company as a commercial natural language pro- 
cessing product (http://alias-i.com/lingpipe). LingPipe performs 
co-reference resolution through a set of heuristic algorithms that 
link together mentions found by internal functions. 12 Input for 
the system was through command line functions specifying the 
location of the input text documents, and output was a text 
document containing xml tags surrounding discovered concept 
mentions and a chain identifier if the mention was found to be 
co-referent. A translator similar to the one used to map the 
BART system output was constructed to make the data useable 
in this study. Individual concept type linking scores using the 
exact match scoring are listed in table 1. The i2b2 evaluation 
script results for each training dataset are shown in table 2. 

Our system 

The reasoning behind choosing a rule based approach for the 
UHD algorithm was strictly because of the specific nature of the 
challenge. Machine learning algorithms can perform the same, 
if not better, than rule based algorithms; however, they can take 
much more time to construct. Rule based algorithms rely on 
human knowledge for their performance rather than gathering 
their own information. For that reason they can be quicker to 
build, but are less adaptable to changes in the structure and 



Table 1 Exact match fl scores for the four systems on individual 
concept mention types in the Beth Israel, Partners Healthcare, and 
Mayo Clinic across unused training data 

All 

System Dataset People Problems Test Treatments others 



Beth Israel 


0.958 


0.690 


0.389 


0.597 


N/A 


Partners 


0.953 


0.696 


0.462 


0.624 


N/A 


Healthcare 












Mayo Clinic 


0.593 


0.667 


N/A 


0.500 


0.453 


Beth Israel 


0.590 


0.202 


0.166 


0.300 


N/A 


Partners 


0.475 


0.206 


0.253 


0.263 


N/A 


Healthcare 












Mayo Clinic 


0.410 


0.000 


N/A 


0.000 


0.000 


Beth Israel 


0.205 


0.076 


0.000 


0.096 


N/A 


Partners 


0.251 


0.073 


0.074 


0.061 


N/A 


Healthcare 












Mayo Clinic 


0.069 


0.000 


N/A 


0.000 


0.000 


Beth Israel 


0.243 


0.015 


0.029 


0.092 


N/A 


Partners 


0.139 


0.067 


0.088 


0.066 


N/A 


Healthcare 












Mayo Clinic 


0.071 


0.000 


N/A 


0.000 


0.000 



BART, Beautiful Anaphora Resolution Toolkit; N/A, not applicable; UHD, University of 
Houston-Downtown. 
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Table 2 i2b2 evaluation script overall f1 score results of the 
unused training data for all four systems 



System 


Beth Israel 


Partners Healthcare 


Mayo Clinic 


UHD 


0.891 


0.912 


0.789 


BART 


0.775 


0.712 


0.436 


Stanford 


0.627 


0.633 


0.423 


LingPipe 


0.628 


0.601 


0.423 



BART, Beautiful anaphora resolution toolkit; UHD, University of Houston-Downtown. 



types of data. Our system could conceivably be used for other 
types of English texts if given concept markables in the same 
style as the i2b2 data. It is not restricted to only the types given 
for the challenge and will attempt to process concepts of any 
type given to it. The algorithm only looks for matching concept 
types before testing co-reference between the concepts. The 
question as to whether this algorithm would work well with a 
type of document other than medical documents is as of now 
untested. This algorithm did do well as far as adapting to new 
data sources in the context of this competition. The University 
of Pittsburgh training data was released near the close of the 
challenge; the UHD algorithm had shown similar performance, 
about 1% higher fl score, on that data than on the data that 
was being used to construct it. Individual concept type linking 
scores using the exact match scoring are listed in table 1. The 
i2b2 evaluation script results for each training dataset are shown 
in table 2. 

Combining results 

Once result data were collected, combinations of link results 
from the rule based system and the BART system were examined 
since the BART system showed the highest amount of correct 
link predictions. After combining the results from the two 
systems as a union of the sets, the statistics showed an increase 
of about 1% in recall but a decline of about 15% in precision, 
bringing the fl score down overall. The combination of our 
system and BART was the only one attempted as it was felt that 
no better gain would be achieved from the other systems in 
union with ours since BART performed the best, and the time it 
takes to test the combinations would be better spent improving 
our own system. Since recall of the combination of the UHD 
and BART systems is only 1% higher, it can be said that the 
UHD system found nearly all of the correct co-referent links 
that the other systems found with a much higher precision. 

Challenge participation 

In order to participate in the challenge, each team participating 
was given test data that did not include the gold standard 
co-reference chains. After processing the data, each team sub- 
mitted the data for evaluation by the hosts of the challenge. The 
system used for our submission to the challenge was the rule 
based system constructed by the UHD team since it showed the 
highest performance on the unused training data. 

Challenge results 

The system the UHD team constructed had an fl score average 
of 0.895 on all of the datasets provided for the testing. This 
score was the only score provided by the hosts of the competi- 
tion and represents the performance of the UHD system on all 
of the datasets during the competition evaluation. According to 
the hosts of the competition, our team ranked fourth in the 



challenge, with the top four performing systems being in a close 
tie for the first place. The highest performing system had an fl 
score of 0.915. 



CONCLUSION 

Since the goal of the 2011 i2b2 Natural Language Processing 
task was to mark concept mentions as co-reference or not, the 
rule based system developed for this study was used to mark 
links in the test data released by the organization for the chal- 
lenge. This decision was made based on the results from cross- 
checking the performance of each system on the training data 
provided. The results show the BART system performed the best 
out of the three publicly available co-reference systems tested in 
this study on this specific collection of data. The results also 
show that manually creating rules for co-reference based on 
observation of training data is a valid way to accomplish this 
co-reference task, particularly with the person type concepts in 
the i2b2 style annotations, and in this case performed well using 
the guidelines laid out by the hosts of the competition. The 
results listed in this paper show that the rule based system out- 
performed the three publicly available systems; this is due to the 
fact that the publicly available systems are general purpose 
systems designed to detect co-reference of people and named 
entities and the UHD rule based system was designed specifically 
for this challenge and these markables, and the publicly avail- 
able systems must discover their own markables. The public 
systems should be given credit though for being able to detect 
co-referent links in this environment, and because they are 
responsible for discovering their own markables. It is not a 
stretch to imagine that these systems took a fair amount of time 
to develop, and can perform in many situations, whereas the 
UHD rule based system will operate in only the context of i2b2 
or ODIE marked documents, which represent a variety of clin- 
ical reports from different institutions. Development costs can 
be higher on machine learning algorithms, like BART and the 
Stanford systems. However, in specific contexts such as this 
competition, a high amount of performance can be achieved 
with the lower cost rule based algorithms. The UHD rule based 
algorithm could be used, theoretically, in any domain as long 
concepts are annotated on one of the two styles used in this 
challenge. 
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